
(CVPR 2018) Frustum pointnets for 3d object detection from rgb-d data

Qi C R, Liu W, Wu C, et al. Frustum pointnets for 3d object detection from rgb-d data[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 918-927.

1. Overview

  • Previous method focus on images or 3D voxels
  • Treat RGB-D data as 2D maps for CNN
  • Learning in 3D space can better exploit the geometric and topological structure of 3D space and apply transformation

In this paper, it proposed Frustum PointNet

  • operate on raw point clouds by RGB-D scans
  • leverage both 2D detection and 3D object localization
  • key challenge. efficientlypropose possible localtions of 3D obj in 3D space

  • 2D proposal→frustum proposal→segmentation→3D box estimation

  • coordinate normalization

  • Front View Image Based Methods
    represent depth data as 3D maps
  • Bird’s Eye View Based Methods
    MV3D. project LiDAR point cloud to bird’s eye view and train RPN for 3D bounding box proposal
  • 3D Based Methods
  • Deep Learning on Point Clouds

1.2. Problem Definitions

  • depth data. obtained from LiDAR or indoor depth sensors and represented as a point cloud
  • the projection matrix is known. can get a 3D frustum from a 2D image region
  • 3D box is parameterized by size (h, w, l), center (c_x, c_y, c_z) and orientation (Θ, φ, ψ).
    only consider heading angle Θ in this paper.

1.3. Dataset

  • KITTI (outdoor). RGB + LiDAR point cloud (sparse due to distence)
  • SUN-RGBD (indoor). RGB-D (dense)
    general framework to sparse cloud and dense cloud.

2. Frustum PointNets

2.1. Frustum Proposal

The resolution of data produced by moist 3D sensors (especially real-time depth sensors) is still lower than RGB image from commodity camera.

  • 2D RGB detector. Fast R-CNN, FPN, focal loss
  • with known camera projection matrix, 2D box can be lifted to frustum
  • rotate. center axis of frustum if orthogonal to the image plane

2.2. 3D Instance Segmentation

2.2.1. V1 PointNet

2.2.2. V2 PointNet++

  • directly regress 3D object location from a depth map using 2D CNN is not easy, as occluding objects and background clutter
  • segmentation (binary classification of pixel level) in 3D point cloud is much more natural

  • leverage the semantics from 2D detector (one-hot class vector)
    segmentation network can use this prior to find geometries of that category.

  • coordinate normalization. transform the point cloud by subtracting XYZ values of centroid
  • mask the input frustum

2.3. Amodal 3D Box Estimation

2.3.1. T-Net

  • the origin of the mask coordinate frame may be far from the amodal box center
  • STN (no direct supervision) vs T-Net (explicitly supervise)

2.3.2. Box Estimation PointNet

V1 PointNet

V2 PointNet++

  • box center residual prediction. combined with the previous center residual from the T-Net and the masked points’ centroid to recover an absolute center

  • pre-defined NS size templates (3:height, width, length) and NH equally split angle (Θ) bins (NS scores for size, NH socres for heading)

  • output dimension. 3(center point) + 4xNS + 2xNH

2.4. Multi-task Loss

  • L_{c1-reg}. center of T-Net
  • L_{c2-reg}. center of box estimation net
  • L_{h-cls}, L_{h-reg}. heading angle prediction
  • L_{s-cls}, L_{s-reg}. size prediction
  • L_{corner}. corner loss for joint optimization pf box parameters

Optimized for final 3D box accuracy (center, size and heading) have separate loss terms. And they should be jointly optimized→corner positions are jointly determined by center, size and hearding.

  1. for each of NS x NH box
  2. only foucs on the gt size/heading class
  3. sum of the distance between the eight corners of prediction and gt box

To avoid large penalty from flipped heading estimation, further compute p from the flipped gt box and use minimum** of them.

3. Experiments

3.1. Comparison

3.2. Ablation Study

3.2.1. 2D vs 3D

  • contains clutter and background

3.2.2. contains clutter and background

  • Frustum rotation and mask centroid subtraction are critical

3.2.3. Loss Function

3.2.4. PointNet Version

3.3. Failure Case

  • inaccurate pose and sparse cloud (less than 5 points)
  • multiple instances from the same category in a frustum
  • 2D detector misses objects due to dark light or strong occlusion